1 INTRODUCTION

Intracellular infections of epithelial cells are hypothesized to produce expression of gene pathways related to immune response, infection response, and inflammation. To test this hypothesis, Salmonella Typhimurium and the Hela-229 cell line were chosen. Salmonella are intracellular bacteria, which infect and replicate within both epithelial cells and macrophages, causing acute gastroenteritis in humans. The HeLa-229 human cell line is epithelial in origin, so will constitute an acceptable host for Salmonella.

This experiment will test this hypothesis using a DESeq2 analysis, with a two-group experimental design.

2 RESULTS

2.1 Overview

An DESeq2 analysis was conducted on data comparing gene expression in HeLa-229 cells infected with Salmonella Typhimurium with a control group. The analysis was performed with the DESeq2 tool, using the R package of the same name.

2.1.1 Data Set

Title: Transcriptomic profiling of HeLa-229 cells infected with Salmonella Typhimurium
Accession Number: SRP034009 [https://trace.ncbi.nlm.nih.gov/Traces/study/?acc=SRP034009]
BioProject: PRJNA231559 [https://www.ncbi.nlm.nih.gov/bioproject/PRJNA231559]
File URL [http://duffel.rail.bio/recount/v2/SRP034009/rse_gene.Rdata]

2.1.2 Experimental Design

The data set was a two-group experimental design with three replicates in each group. One group was the Salmonella infected cells, the second, the control group.

2.1.3 Analysis Process

The DESeq2 R package was used to perform the Differential Gene Expression Analysis (DGE) and the clusterProfiler R package for the Gene Set Enrichment Analysis (GSEA).

  • An initial analysis reviewed the counts, and their distribution to confirm quality and statistical criteria were satisfied.
  • A Principle Component Analysis was performed for quality assessment and exploratory analysis.
  • A Volcano Plot was generated, highlighting significant expression of genes.
  • A HeatMap was generated for the most over and under expressed genes.
  • A Gene Set Enrichment Analysis was performed to determine to most significantly expressed pathways.

2.2 Principal Component Analysis

A Principal Component Analysis (PCA) was performed on the data which showed that first 2 components explained 95.4% of the variance.

The component plots show strong differentiation on the PC1 axis for the control and condition groups, with tight clustering for the control replicates, and Salmonella replicates split into two clusters, with Salmonella Replicate 1 differing from the other 2 Salmonella replicates. The heatmap shows a different pattern of of expression for Salmonella replicate and this is explorer further in section.

2.3 DESeq2 Analysis

The DESeq2 analysis was generated and a MA plot generated, which showed decreasing variability with increased counts, and an approximately equal proportion of over and under expressed genes.

The table below shows all the significantly differentially expressed genes.

2.4 Volcano Plot

The top most significantly expressed genes included CCL2, IL6, CXCL2, CTSL and all of these are involved in inflammation or immune response pathways. Since they are differentially expressed, the presence of Salmonella is a possible cause. The underlying presence of HPV-18 in the cell line may also explain some of this expression1, but this cannot be determined with the experiment design.

Overall the plot shows conclusively that significant over and under expression occurred, with somewhat more over-expression than under-expression.

2.5 HeatMap

Evaluating the top genes together as a group using Enrichr 2 demonstrated that they are correlated with pathways such as Endogenous Toll-Like Receptor signalling, involved in infection responses and IL-17 Signalling, involved in inflammation responses. These pathways were detected in the GSEA analysis as well.

2.6 Gene Set Enrichment Analysis

A Gene Set Enrichment Analysis (GSEA) was performed, in which some pathways 3 were related to infection and immunologic specific pathways, specifically:

  • GOBP_NEGATIVE_REGULATION_OF_WOUND_HEALING
  • GOBP_REGULATION_OF_WOUND_HEALING
  • GOBP_REGULATION_OF_RESPONSE_TO_WOUNDING
  • GOBP_WOUND_HEALING

Some pathways were suggestive of tumor expression, such as GOBP_EPITHELIAL_CELL_PROLIFERATION. The presence of these may be confounding factors related to the HeLa cell line or HPV-18.

Overall, the most significant pathways had very close enrichment scores (within 15% of each other, for the top 100 pathways), with many pathways not obviously related to the experiment. Therefore, no obvious trend was revealed from the GSEA analysis, and support for the hypothesis was weak.

2.6.1 Top 5 Over and Under-Expressed Pathways

2.6.2 Top 5 GSEA Plots

The peak enrichment scores for the top 5 plots were very similar, although most of the pathways did not directly relate to the hypothesis being explored.

2.6.3 Significant Pathways

The significant pathways are shown in the table below. All have very close enrichment scores - from 2.4 to 2.06 for the first 100 pathways. So the the most significant pathways are too similarly expressed to make meaningful distinctions between them.

3 DISCUSSION

The GSEA analysis showed several pathway expression relating to inflammation, immune and bacterial infection response, along with many other pathways with similar enrichment scores. Overall, there was limited support for the hypothesis.

There was expression of some pathways potentially related to oncogenesis, which could possibly be related to the cell line. There is previous evidence to suggest highly aberrant expression of some pathways in this cell line 4. Also, the cell line contains Human Papilloma Virus 18 DNA, known to cause the majority of cervical neoplasms and also implicated in the cell line’s characteristic immortality. There is evidence of HPV-18 mRNA expression in HeLa cell lines5. These are all confounding factors which warrant further investigation.

The value of investigating individual gene expression with heatmaps or volcano plots for evaluating complex pathways appears limited, as gene expression and functional roles are heavily context dependent. The GSEA analysis was found to be slightly more effective in discovering the cellular responses to intracellular bacterial infection.

3.1 Future Lines of Enquiry

Given the nature of the cell-line used in this experiment, future experiments could be made with other, non-neoplastic epithelial cell lines, to try and eliminate the confounding problems of aberrant tumor cell expression and viral infection. Another dimension would be to use other intracellular bacteria to see how pathway expression differs across bacterial species. It is also unclear, at what stage of the infection the samples were taken, as this may affect gene expression, as well as the natural variation in infection life-cycles between individual cells.

From a methodological perspective, it is recommended to perform RNA-seq analyses with at least two methods, from toolsets such as DESeq2, EdgeR and Limma-Voom.

4 APPENDIX

4.1 Duplicate Reads in Data

A review of the DEseq data found a large number of read count duplicates (87.6% of counts has at least 1 duplicate), which gave warnings in the GSEA analysis. Duplicate reads are common in RNA-seq data sets, though the high degree of duplicate may be a cause for concern and may be caused by technical sequencing issues or a high abundance of a small number of genes.

After consideration, no action was taken to remove the duplicate values.

The table below shows duplicates of two or more counts in the raw assay data.

# assay count duplicates
assay_df <- data.frame(assay(rse_gene))
assay_df<- assay_df %>% arrange(desc(SRR1049363))
dupsCounts <- assay_df %>% group_by(SRR1049363) %>% filter(n() > 1)
head(dupsCounts, n = 12)
## # A tibble: 12 × 6
## # Groups:   SRR1049363 [6]
##    SRR1049363 SRR1049364 SRR1049365 SRR1049366 SRR1049367 SRR1049368
##         <dbl>      <dbl>      <dbl>      <dbl>      <dbl>      <dbl>
##  1      54302      45609      40415      21281      23656      29219
##  2      54302      61654      61103      10225       7706       9846
##  3      53022      55343      51375      34511      34428      41473
##  4      53022      58643      56965      24197      26196      31438
##  5      50472      58848      55352      55674      52378      64884
##  6      50472      52794      53197      30063      30721      38816
##  7      49502      48087      45419      50665      45347      54503
##  8      49502      55923      59935      19846      18815      21361
##  9      46223      49472      47237      13190      14856      17526
## 10      46223      54347      51152      29966      30935      38122
## 11      44507      55588      53645      48451      48295      56803
## 12      44507      52747      53455      16894      16916      20433
duplicated_n <- nrow(dupsCounts)
duplicated_percentage <- round(duplicated_n/nrow(assay_df) * 100, 1)
print(duplicated_percentage)
## [1] 87.6

4.2 Management of Outliers

The initial DESeq2 analysis discovered a high proportion of outliers (13%). The data contains 2 groups of 3 replicates, and the DESeq analysis contains a parameter ‘minReplicatesForReplace’ with a default value of 7, which is ‘the minimum number of replicates required in order to use replaceOutliers on a sample’, so for this data set, no outlier replacement was performed.

 out of 30230 with nonzero total read count
   adjusted p-value < 0.1
   LFC > 0 (up)       : 5555, 18%
   LFC < 0 (down)     : 5050, 17%
   outliers [1]       : 3820, 13%
   low counts [2]     : 2187, 7.2%
   (mean count < 9)

A Box plot of the Cooks distances shows similar distributions of outliers for all replicates, so no single replicate stands out as a source of technical error. One replicate, SRR1049367 (Salmonella 2), did contain a lot of small distances, but this is not of significance.

4.2.1 Replacing Outliers

The DESeq2 analysis was repeated with ‘minReplicatesForReplace’ set to the minimum value of 3, which allows outliers in a single control group to be replaced.

This configuration eliminated the outliers by replacement.

summary(dds_res)
## 
## out of 25029 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up)       : 5426, 22%
## LFC < 0 (down)     : 4909, 20%
## outliers [1]       : 0, 0%
## low counts [2]     : 1456, 5.8%
## (mean count < 16)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results

A review of the most extreme raw counts shows that the very high log2 fold changes are non-zero values in a single replicate group, with zero values in the other group. These values appear valid as the non-zero counts are relatively consistent between replicates, and the zero values merely represent a lower than ideal sequencing depth.

library(tibble)

raw_counts <- counts(dds)
raw_counts <- data.frame(raw_counts) %>% 
              rownames_to_column("GENEID") %>%
              mutate(GENEID=sub('\\..+', "", GENEID))

# possible outliers in this data set
dds_res_max_values <- dds_res_df %>%
  dplyr::filter(abs(log2FoldChange) > 10) %>%
  arrange(desc(log2FoldChange))

# this is raw counts
inner_join(x=raw_counts,
           y=dds_res_max_values,
           by="GENEID") %>%
           dplyr::select(GENEID, symbol, log2FoldChange, c(2:7)) %>%
           arrange(desc(log2FoldChange)) %>%
           dplyr::filter(row_number() <= TOP_PATHWAYS | row_number()> n()- TOP_PATHWAYS) %>%
           dplyr::rename("Ctl1" = "SRR1049363", "Ctl2" = "SRR1049364", "Ctl3" = "SRR1049365", 
                         "Salm1" = "SRR1049366", "Salm2" = "SRR1049367", "Salm3" = "SRR1049368")
##             GENEID       symbol log2FoldChange Ctl1 Ctl2 Ctl3 Salm1 Salm2 Salm3
## 1  ENSG00000145864       GABRB2       30.00000    0    0    0     0 75264 51774
## 2  ENSG00000057149     SERPINB3       15.97002    0    0    0  7172  7578  7139
## 3  ENSG00000171557          FGG       15.55907    0    0    0  5571  5609  5271
## 4  ENSG00000206073     SERPINB4       15.50929    0    0    0  6036  4874  4973
## 5  ENSG00000108700         CCL8       14.79386    0    0    0  3339  3081  3287
## 6  ENSG00000126838          PZP      -11.13688  605  542  858     0     0     0
## 7  ENSG00000141639        MAPK4      -11.22592  909  511  688     0     0     0
## 8  ENSG00000185518         SV2B      -11.53802  539 1113 1025     0     0     0
## 9  ENSG00000232537 RP11-385M4.1      -11.54841  721  832 1123     0     0     0
## 10 ENSG00000164120         HPGD      -11.79331  912 1502  760     0     0     0

4.3 Removal of Technical Errors

Some counts appeared in only a single replicate and were highly significant - these were considered technical artefacts that were removed. Many of these counts had duplicated normalized counts, and a sample of these is shown below.

All of these were removed from the data set.

head(sig_norm_counts)
                   SRR1049363 SRR1049364 SRR1049365 SRR1049366  SRR1049367 SRR1049368
ENSG00000145864.12          0          0          0          0 99913.64158    58646.4
ENSG00000177770.11          0          0          0          0    92.92563        0.0
ENSG00000200502.1           0          0          0          0    92.92563        0.0
ENSG00000206881.1           0          0          0          0    92.92563        0.0
ENSG00000228550.1           0          0          0          0    92.92563        0.0
ENSG00000253302.1           0          0          0          0    92.92563        0.0

4.4 Analysis of Distribution

DESeq2 assumes a negative normal distribution, so the data was checked to assess where this was a valid assumption. The dispersion plot shows that variances exceed the means, satisfying the requirement for accepting the negative binomial distribution assumption.

4.5 GSEA Plots in clusterProfiler R Package

The gseaplot function in clusterProfile has font settings which are too big for R markdown files. A defect ticket was raised in the enrichplot repository in GitHub(https://github.com/YuLab-SMU/enrichplot/issues/129) and a local wrapper function gseaplot_local which replicated the gseaplot function but added configurable font settings to enable printing multiple GSEA plots side-by-side.

The code for this is in /gseaplot_local.R.

Once this defect is addressed, the local wrapper function can be replaced by the original gseaplot function call.

4.6 Software Version Details

The details below detail the relevant system version details used to produce this analysis.

## R version 4.1.0 (2021-05-18)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur 10.16
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] enrichplot_1.12.2           ggpubr_0.4.0                clusterProfiler_4.0.2      
##  [4] msigdbr_7.4.1               pheatmap_1.0.12             EnhancedVolcano_1.10.0     
##  [7] here_1.0.1                  DT_0.18                     stringr_1.4.0              
## [10] data.table_1.14.0           tibble_3.1.3                dplyr_1.0.7                
## [13] BiocParallel_1.26.1         gridExtra_2.3               plotly_4.9.4.1             
## [16] ggrepel_0.9.1               ggplot2_3.3.5               DESeq2_1.32.0              
## [19] SummarizedExperiment_1.22.0 Biobase_2.52.0              MatrixGenerics_1.4.0       
## [22] matrixStats_0.59.0          GenomicRanges_1.44.0        GenomeInfoDb_1.28.1        
## [25] IRanges_2.26.0              S4Vectors_0.30.0            BiocGenerics_0.38.0        
## 
## loaded via a namespace (and not attached):
##   [1] readxl_1.3.1           backports_1.2.1        shadowtext_0.0.8       fastmatch_1.1-0       
##   [5] plyr_1.8.6             igraph_1.2.6           lazyeval_0.2.2         splines_4.1.0         
##   [9] crosstalk_1.1.1        digest_0.6.27          htmltools_0.5.1.1      GOSemSim_2.18.0       
##  [13] viridis_0.6.1          GO.db_3.13.0           fansi_0.5.0            magrittr_2.0.1        
##  [17] memoise_2.0.0          openxlsx_4.2.4         Biostrings_2.60.1      annotate_1.70.0       
##  [21] graphlayouts_0.7.1     extrafont_0.17         extrafontdb_1.0        colorspace_2.0-2      
##  [25] blob_1.2.1             haven_2.4.1            xfun_0.24              crayon_1.4.1          
##  [29] RCurl_1.98-1.3         jsonlite_1.7.2         scatterpie_0.1.6       genefilter_1.74.0     
##  [33] survival_3.2-11        ape_5.5                glue_1.4.2             polyclip_1.10-0       
##  [37] gtable_0.3.0           zlibbioc_1.38.0        XVector_0.32.0         DelayedArray_0.18.0   
##  [41] proj4_1.0-10.1         car_3.0-11             Rttf2pt1_1.3.8         maps_3.3.0            
##  [45] abind_1.4-5            scales_1.1.1           DOSE_3.18.1            DBI_1.1.1             
##  [49] rstatix_0.7.0          Rcpp_1.0.7             viridisLite_0.4.0      xtable_1.8-4          
##  [53] tidytree_0.3.4         foreign_0.8-81         bit_4.0.4              htmlwidgets_1.5.3     
##  [57] httr_1.4.2             fgsea_1.18.0           RColorBrewer_1.1-2     ellipsis_0.3.2        
##  [61] pkgconfig_2.0.3        XML_3.99-0.6           farver_2.1.0           locfit_1.5-9.4        
##  [65] utf8_1.2.1             tidyselect_1.1.1       labeling_0.4.2         rlang_0.4.11          
##  [69] reshape2_1.4.4         AnnotationDbi_1.54.1   cellranger_1.1.0       munsell_0.5.0         
##  [73] tools_4.1.0            cachem_1.0.5           cli_3.0.0              downloader_0.4        
##  [77] generics_0.1.0         RSQLite_2.2.7          broom_0.7.8            evaluate_0.14         
##  [81] fastmap_1.1.0          yaml_2.2.1             ggtree_3.0.2           babelgene_21.4        
##  [85] knitr_1.33             bit64_4.0.5            tidygraph_1.2.0        zip_2.2.0             
##  [89] purrr_0.3.4            KEGGREST_1.32.0        ggraph_2.0.5           nlme_3.1-152          
##  [93] ash_1.0-15             ggrastr_0.2.3          aplot_0.0.6            DO.db_2.9             
##  [97] rstudioapi_0.13        compiler_4.1.0         curl_4.3.2             beeswarm_0.4.0        
## [101] png_0.1-7              ggsignif_0.6.2         treeio_1.16.1          tweenr_1.0.2          
## [105] geneplotter_1.70.0     stringi_1.7.3          highr_0.9              ggalt_0.4.0           
## [109] forcats_0.5.1          lattice_0.20-44        Matrix_1.3-4           vctrs_0.3.8           
## [113] pillar_1.6.1           lifecycle_1.0.0        BiocManager_1.30.16    cowplot_1.1.1         
## [117] bitops_1.0-7           patchwork_1.1.1        qvalue_2.24.0          R6_2.5.0              
## [121] rio_0.5.27             KernSmooth_2.23-20     vipor_0.4.5            MASS_7.3-54           
## [125] assertthat_0.2.1       rprojroot_2.0.2        withr_2.4.2            GenomeInfoDbData_1.2.6
## [129] hms_1.1.0              grid_4.1.0             tidyr_1.1.3            rmarkdown_2.9         
## [133] rvcheck_0.1.8          carData_3.0-4          ggforce_0.3.3          ggbeeswarm_0.6.0

4.7 References